@article{soto2024diffappendices,
= "Appendices for lectures on diffusion models",
title = "Soto, Julio A.",
author = "2024",
year = "Nov",
month = "https://julioasotodv.github.io/ie-c4-466671-diffusion-models/Appendices%20for%20lectures%20on%20diffusion%20models.html"
url }
Appendices for lectures on diffusion models
This document accompanies the slides available here
A. The Nice™ property
This property allows us to jump to any
As seen in slide 14 and Equation
Now, let’s define two additional variables that will become handy. We will define:
And
With that in mind, now let’s express
With the reparametrization trick from
As you may imagine,
To simplify it, we can take advantage of a cool property of Normal distributions, where the sum
Think for a second: if we apply the reparametrization trick
We can keep on recursively developing
… And we could keep on, until we write everything in terms of only
And now we can apply the definition of
Finally, applying the reparametrization trick in reverse again we can state that:
As seen in the Nice™ property slide and in Equation
B. Diffusion loss function: ELBO derivation
The DDPM model looks very similar to a Variational Autoencoder if you think of it, except for three little things:
We can think of a DDPM as a VAE where the forward diffusion process is the VAE encoder; and the reverse diffusion process is the VAE decoder. However, the forward diffusion process in a DDPM does not need to be learned by a neural network: we just set it up as a sequence of noise additions (as we have seen)
A VAE computes the latent space in a single step, whereas in DDPMs we perform many steps to reach there. However, there is a variant of VAEs that also involves many steps in an almost identical fashion to a DDPM, called MHVAEs (Markovian Hierarchical Variational Autoencoders). In fact, both forward and reverse processes in DDPM are Markovian, meaning that they follow the Markov property: in which the state of the noisy image at a specific timestep
only depends on the state of the image in the immediately previous step ( in the forward process and in the reverse process, respectively)We can think of the fully noised image
in similar terms to a VAE’s latent space . However, the latent space in VAEs has smaller dimensions than the original image, whereas in DDPMs the dimensions are the same as in the original image (same image height, width and channels)
Other than that, both models are very similar. Therefore, we can try to express the loss function of a DDPM by building on the VAE one.
Note: There are more ways to get to the same loss function that we will end up with. By using different principles or slightly different definitions, we can still reach the same final equation. I say this because you may find different derivations online or in other books/papers. However, the final loss expression should be the same (or at least equivalent).
B.1. VAE loss
At the end of the day, both in VAEs and in DDPMs we want our final generated image to be as accurate as possible. The most widely used concept to measure how well the generated output matches the training data is the likelihood.
Given a model with learnable weights/parameters
Therefore, our goal with these generative models will be (at least in part) to maximize the likelihood
Nevertheless, we came here to generate new images—not to only learn how to perfectly rebuild an already existing one. Therefore, we will add some additional terms to the loss function to encourage generative properties (instead of the pure reconstruction quality measured by the likelihood).
That’s why the VAE loss function includes an additional term: the Kullback-Leibler (KL) Divergence between the learned latent space
The composition of these two terms gives us the VAE loss function to minimize, which is the maximization of a quantity usually known as ELBO (Evidence Lower BOund)—even though it is also known as VLB (Variational Lower Bound). We won’t discuss here why it is called this way or why it is usually expressed as an inequality
There is a lot to digest here. First and foremost,
is the (log) likelihood of the VAE’s decoder output (that generates images given a sampled instance from the latent space ), which measures how well the decoder is able to create images that look as if they came from the original data. Hence, this is the reconstruction term (don’t worry for now about the expectation ; it just means averaging over all possible values of ) is the Kullback-Leibler Divergence between the encoder’s output (the data in the latent space given a training data image ) and , which is the prior for the latent space . We set this prior to a Standard Normal in the VAE, as stated earlier. This divergence is always (all KL divergences always are), and the lower the better (because that it would mean that the latent space resembles a Standard Normal, from which we can easily sample). Therefore, this is the prior matching term
So, to train a VAE we try to maximize the ELBO. In order to do so, we will maximize the reconstruction term and minimize the prior matching term at the same time.
That is the “final” expression for the VAE ELBO. However, to understand better DDPM’s ELBO it would be useful to use a more generic expression for VAE’s ELBO. To do so, we can work our way back. First, we can recall a general definition of KL divergence as:
Which can be used to “undo” the KL divergence in VAE loss as:
This is the expression we will start from in order to compute the ELBO for the DDPM model.
B.2. DDPM loss
To start working on the DDPM loss, we just have to adapt the ELBO we have for the VAE to suit the model differences between VAEs and DDPMs, as we discussed earlier. Therefore, we will introduce the following changes:
doesn’t really exist as a latent space in DDPMs. Or rather, we could say that in DDPMs we have many latent spaces—the image with each of the different noise levels!: . Therefore in , the joint distribution of and , will become (the joint distribution over all different noise states for an image) was the encoder output in the VAE, but in DDPM it is the forward diffusion process. As we stated earlier, the forward process does not have any learnable parameters since it is not a learned process; therefore, we can fully drop from the notation. Furthermore, compared to a VAE now the goal is to produce noisy versions of the image starting from . Hence, will now become for our DDPM.
With this in mind, the ELBO for the DDPM model becomes:
As seen in Equation
As seen in slide 17 and in Equation
Similarly, for the forward process we have a similar chain, in this case in the form of:
As seen in Equation
We will now slightly further develop the products, since it will become handy later. We will decouple the first term from the product in
And the same for
Including these two back in
That
Instead, we will be able to get rid of that ugly expectation by using a simple idea: we can instead re-write
With that in mind, we will substitute that term in the denominator of
Now let’s take advantage of
We will now apply Bayes’ theorem to the denominator in the product, as we anticipated before:
And we can simplify the last product in
You can see the clear pattern, right? Knowing this,
With this, we end up with three expectation terms. All three are expectations over
Therefore, we can write
Now, the second and third terms look very similar to KL Divergences. We can therefore apply
As for the third term in
To better understand how it works, we will convert the expectation to an integral. This can be done by using what is known as the Law of the unconscious statistician or LOTUS (yes: that is the actual name). The LOTUS is as follows:
Where:
is any function involving two random variables and (similar to the we have above) is the pdf (probability density function) of
So we are integrating over
Let’s apply this law to that third expectation term in
Given that we integrate over both
Now, the expression
Substituting this in the integrals:
Now to the main trick: the “inner” integral is integrating over
With this, we can go back to writing expectations instead of integrals:
We can now use again the definition of KL divergence in
This way the third expectation is also greatly simplified.
With that, we can now plug
And that’s the loss function for the DDPM model! Let’s label each of these three terms:
We can briefly describe these three:
The reconstruction term is very similar to the VAE one, with the difference that since in the DDPM model we have many denoising steps in the reverse diffusion process, this term only focuses on the last step: Going from
to , this is: from the least noisy image to the original oneThe prior matching term represents how close the most noisy image is to
, which if you remember, we said that this is just full noise data from a Standard Normal . Since there are no learnable weights/parameters in this term (no involved anywhere), we can safely ignore it during trainingThe denoising matching term is the bulk of our loss. It tries to make our model’s learned reverse diffusion process
as close as possible to what would be the ground-truth, real reverse diffusion process, as represented by . We will call this ground-truth the forward process posterior and will be described in detail in Appendix C
To make sure we understand, let’s take a look at the diagram in slide 18:
The bulk of our loss is the denoising matching term, which is the KL divergence between what would be the perfect reverse diffusion process (called forward process posterior) and the one that our model will produce. This means that if our model is able to closely replicate this forward process posterior, we will end up with great quality images! That is why this will be the focus of our procedure.
Sidenote: If you look at the DDPM paper, you will find a similar yet different formula in Equation
It is actually the same equation as
The one in the paper is the negative ELBO (therefore it will be minimized instead of maximized), so signs have flipped
They change the order of the terms, which obviously does not affect at all
The paper’s notation is more vague when it comes to expressing the expectation. However, it is not important. When training the model, the common way to minimize an expectation is basically through iterating over training samples with stochastic gradient descent many times. Therefore, in the paper they do some abuse of notation and just write a single expectation with a generic subscript
Other than that, it is the exact same formula. So, from now on we will use this equation for the loss.
In the paper they refer to those three terms with the names
, as stated before, has no learnable parameters: so we don’t care about it and we will just drop it will be our focus gets a special treatment in Section 3.3 of the DDPM paper, and they show how it can be learned with a separate model. However, it just focuses on what would be the very last denoising step (out of a thousand of them); therefore, its importance is quite negligible (as is almost noiseless already, given that we have set a small enough ). Therefore, the authors decide in Section 3.4 to ignore this term as well
Note: This same derivation can be found in Appendix A (page 13) of the DDPM paper. However, the steps described here go into much more detail than what you will find there. With that said, it is basically the same derivation.
To conclude: all of a sudden, our model will only care about
C. The forward process posterior
In Appendix B we concluded that we will focus on the
If we look closely, we can see that the second term
However, the first term
Indeed, this is known as the forward process posterior. Let’s forget for a second about the
Unfortunately, this forward process posterior
To solve this problem, we can condition the forward process posterior on
Now we can continue. The forward process posterior
- We condition it on
—therefore turning it into (as just discussed), and - We assume it is also a Normal distribution
To better understand why we can assume it is a Normal distribution, the paper mentions in page 2 that both processes (meaning forward and backward) have the same functional form when
Given that we indeed set
As seen in Equation
Applying Bayes’ rule, we will get:
We have colored each term separately to recognize them below. Under the assumption that all of those are Normal distributions, we can recall that the probability density function (pdf) of a Normal distribution
To make it cleaner, let’s rewrite
And now let’s develop each term in
And we already know that
Where we have colored a specific (the most relevant) part of the pdf (the one that won’t be the same in the other two Bayes’ terms).
Second Bayes term in
And finally, for the last Bayes term in
With all three terms developed, we can take advantage of the property
Now we need to grab that huge expression and somehow “re-pack” it into something that also looks like a Normal distribution. In fact, all of that is proportional to the pdf to some Normal distribution (we say it is proportional due to the
You may be wondering about the
Even more: given that
Now, let’s look again at the pdf formula for a Normal distribution
It looks kind of similar to our
It does. In fact: if we are able to make them match, we will be able to obtain the mean and variance of a Normal distribution whose pdf would be proportional to our red and blue expression. Being proportional would be enough, as we will perform optimization with it (a constant does not alter the optimization result).
Let’s develop the square on the pdf formula in
Now they look even more! Look closely; here we will write down
We can extract a mean and variance from our
Removing the color and developing a bit:
And that’s it for the variance. We can do the same for the mean, by equalling the blue expressions:
And we already did
Due to the Nice™ property
To summarize: we have stated that the forward process posterior
Those
As seen in Equation
Note: in the DDPM paper it looks like their Equation
To summarize: the forward process posterior is what our model will try to replicate. Therefore, we need to be able to compute it during training so our model can try to learn from it. Through this section we have concluded that the forward process posterior
D. Objective & the training procedure
In Appendix B we stated that the training will be focused on the
We want to minimize the KL divergence between those two terms: the “perfect” reverse diffusion (forward process posterior) and the reverse diffusion that our model will learn to do. The first term
And throughout the appendix we developed the expression, concluding in
For the second term in the KL divergence we have
As described in slide 17. You can see that
Since we want to minimize the KL divergence between these two terms, it now becomes evident that a way to do it is by making the latter’s mean and variance match the former’s ones! Let’s continue, because the authors end up setting an equivalent yet different goal.
Let’s start with the variance. In Section 3.2 of the DDPM paper, the authors define
So they give two possibilities/alternatives here. Intuitively, it would make sense to use the second expression (therefore claiming that
With that said, either option for
As for the mean, an obvious choice would be to try to make
In the end, they decided to do the following: they built a model that learns to predict the noise that was added during the forward process. Let’s see how they did it.
Going back to the definition of
We will start by defining
In our case, we can assume the (co)variances of the two Normals to be the same (as discussed earlier in this section) and set them to
Going back to our case, we can now apply
Note: You will find a very similar expression to this one in Equation
Let’s now define a form for what will be the predicted mean
As seen in Equation
To sum up, the authors decided to make a model to predict the noise
Now, by plugging
As seen in Equation
This formula is, literally, the mean squared error between our model’s prediction and the added noise during the forward pass, multiplied by some term
However, the authors state in Section 3.4 that it is both simpler and better for image quality to disregard this term. Hence, the final loss function becomes:
Which straight up is the mean squared error between the model’s prediction and the added noise to the image sample.
Therefore, the training process is as simple as it is shown in Algorithm 1 of the DDPM paper:
Which can be summarized as: pick a training image, pick a random value for
For all the complexity we got from all theory and derivations, we end up with a surprisingly simple formulation for model training.
E. The sampling procedure
After training, we can generate new images out of random Normal noise. The inference (prediction) procedure is usually known as sampling in diffusion models, and it is implemented in Algorithm 2 of the DDPM paper:
We can summarize it as: generate some
With this process in mind, you will most likely have two questions:
If at any given step
we are using our trained model to predict , which is the noise we added through the forward process, why don’t we remove it in a single step? Why go through the trouble of doing steps of progressive denoising?What on Earth is
and why is it there? It is adding back Normal noise to the image in the middle of our denoising process! For all the trouble we got to learn a model capable of denoising, why are we adding noise back again in order to denoise an image??
Let’s address these two questions. To do so, we will present two answers: the mathematical one and the intuitive one.
E.1. The mathematical answer
This one is pretty short and very straightforward, and will answer both questions at once: it’s just how we have defined the process. It’s very simple:
We have defined a forward diffusion process
where given some potentially partially noisy image it returns a slightly noisier imageWe also defined in Appendix C the forward process posterior
, which does the same in reverse (conditioning on , but that is irrelevant now): Given a noisy image it returns a slightly less noisy oneTo have a model that learns to do the same that the forward process posterior does, we defined in Appendix D the reverse diffusion process
that does the same as the forward process posterior: Given a noisy image it returns a slightly less noisy one
This means that given how we have defined all the logic, the way to proceed is to get a fully noisy image (or a
As seen in Appendix D, the reverse process
As with any Normal distribution, we can apply the reparametrization trick as seen in
Now take a look again at step 4 in Algorithm 2. It’s exactly the same! That is because this is literally where it comes from: it is just a matter of applying the reparametrization trick to
This uncovers an important fact about how the process has been designed: the forward process posterior
And the exact same goes for our learned reverse process
If that answers your questions about why we perform many steps in the sampling algorithm and why we still add back more noise during the process, then good for you. However, even if math is clear you may still find it conceptually confusing. Don’t worry: you are not alone. Let’s go through another way to answer those two questions.
E.2. The intuitive answer
Now that we understand all the math behind DDPMs, let’s jump into how the process works conceptually. Arguably the most important intuition is how noising an image works.
During the forward process, we incrementally add Normal noise to the image. As we saw in slide 14, it boils down to creating some
To better understand what adding this noise looks like, we will use some data from the Animal Faces-HQ dataset (version 2), a dataset created by Choi et al. [2020] which features many photos of animal faces. We will only use dog faces for this example.
All images in the dataset are high-quality 512x512 photos, which means that data is very high-dimensional: 512 pixels of height x 512 pixels of width x 3 color channels (RGB) = 786,432. So if we were to flatten out the images, these would “live” in a 786,432-dimensional space!
Unfortunately, we cannot represent 786,432 dimensions on a flat screen. Therefore, we will represent the dimensional space where our dog faces lie scattered in a 2D plane. Keep in mind that our intuitions about the 2D (or 3D) space don’t fully apply to very high-dimensional spaces (due to the curse of dimensionality among other issues), but this is still the best we can do.
Note: We would like to thank Sander Dieleman from Deepmind, as the upcoming diagrams were inspired by his article on diffusion guidance.
Here is the dimensional space we will work with. We are already showing one of the images in our dataset. We will call it
All possible dog faces are points in this dimensional space. However, most points will be just noisy, nonsensical images instead of recognizable dog faces—realistic dog faces will most likely only be present in very narrow, specific parts of this space (think about the probability of getting a good looking dog face just by randomly selecting RGB values for 512x512 pixels).
Nevertheless, here we have our
A good way to understand the mean and variance of the generated noise is to think in terms of vectors:
- The mean represents the direction for the vector (the direction in which we jump)
- The variance represents how long (or short) that vector is (how long our jump is)
Since the mean of the noise we generate is
Due to randomness, we ended up jumping up right. You can see that indeed,
Let’s perform another forward diffusion step. Just like in the first one, we will jump in a random direction, but now from
The image gets noisier and further away from the original. If we perform a third step:
And we could continue doing forward steps up to
Note: obviously diagrams are only illustrative, and in reality the third diffusion step does not yield an image as noisy as depicted here; the actual process is much more gradual. Furthermore, the variance
We also know that due to the Nice™ property we can take shortcuts and go from
And that’s it for the forward process. Now let’s explain the sampling process, as this will let us answer our two questions about the sampling algorithm.
As explained in the sampling algorithm, to generate new images we will start generating a
Most likely we will want our model to generate noiseless, photorealistic dog faces similar to the ones used for training. We will now place in the 2D space the dog face we have shown in the forward process (which we will still call
You can see that we were unlucky enough to generate random noise that is far away from what would be a good image, such as
Unlike in the forward process—where the direction in which we jumped was random—now we have a model that has learned to predict
As you can imagine, our model will not be perfect. Chances of getting the model to predict the perfect, exact direction that would turn
With that in mind, let’s recover the first question we had about the sampling algorithm:
1. If at any given step
we are using our trained model to predict , which is the noise we added through the forward process, why don’t we remove it in a single step? Why go through the trouble of doing steps of progressive denoising?
To understand why this won’t work well, the best way to proceed is to just try doing it!
We know that the Nice™ property allows us to jump from
We can totally do that. The formula is very simple. If we know that the Nice™ property gives us:
The inverse transformation can be obtained by solving for
Where
The dashed line shows our model’s predicted direction, and the jump length is basically dictated by the inverse Nice™ property formula itself. We would then end up in a point that we will call
… And it looks horrible. We get nothing but a blurry image that kind of looks like a dog face, but quality is extremely low. How is that possible? It is relatively close to our reference image
To better understand what is going on, let’s add some more dog faces to our dimensional space. We have obviously trained our model with many dog faces, and all of them lie somewhere in this space. Let’s include two more in our diagram:
We have added some colored contour areas around each dog to depict the fact that the closer a point is to a dog, the more it will resemble that dog. For instance: a point close to the gray Pitbull at the top will look similar to that gray Pitbull.
Here we can see how our prediction
As we discussed in Appendix D our model is trained on the mean squared error between its prediction and the real added noise during the forward process to a partially noisy image
Instead, the model will always choose—due to its lack of information—to predict an averaged, generic dog face. This actually happens anytime we use the mean squared error as a loss function: if you think of a regular univariate linear regression, the prediction is a line that goes kind of in between the training data points. This way the prediction error is never very high; on average, the model always does a relatively good job. But it does not focus on predicting any point particularly better than the rest.
The same is happening here: since the model does not have good information to work with (only random noise), the best it can do is to create a generic dog face; which is basically the average of many dog faces seen during training. The end result is very poor, but our model thinks that it is doing a good job! It is not taking the risk of generating a specific dog face. To the model’s eyes, the prediction it generated is way better than if it perfectly predicted any of the three dog faces we have in the diagram: predicting any of those out of pure random noise would be too risky, and it would violate the principle of “doing good on average, without overfitting” that mean squared error dictates.
This shows the disconnection between how we train our model and the task we want it to perform. We obviously would prefer to obtain any of the three dog faces out of our generative model, instead of a blurry average. These kinds of disconnections are common in generative AI, and they mostly happen because virtually all generative models are just discriminative under the hood—but somewhere in the process we alter their behavior and/or how we use them to obtain generative properties out of them. Other examples of such disconnections are LLM hallucinations or the mode collapse phenomenon that affects GANs.
In order to fix the issue, we will just undergo an iterative process: progressively denoising the image instead of trying to do it in a single step. We will perform many small jumps instead of a single large one, just as we do in the forward process if we don’t use the Nice™ property (see back here). This will effectively have the effect of fooling the model: steering its subsequent predictions so it makes riskier predictions without it knowing. We will see in a moment how this works.
So, instead of trying to jump directly to
As represented by the black arrow.
Given that we have jumped to a new point, if we now repeat these steps (predict new direction and perform a small jump) we will end up with some image after
After any step we end up with a new, slightly denoised image. However, the model doesn’t have memory between jumps—meaning that it does not know that the image used to predict the next jump comes from a previous jump. For the model, a partial “image” predicted out of pure noise might as well be a “real” image that was made partially noisy with the Nice™ property in the forward process; it does not know that it comes from pure random noise. This is effectively how we fool the model and force it to generate more non-genernic, risky images: at the beginning of each step we are basically telling the model: assuming that this image
In fact, to further encourage the appearence of these “deviations” we can already answer the second question we had about the sampling algorithm:
2. What on Earth is
and why is it there? It is adding back Normal noise to the image in the middle of our denoising process! For all the trouble we got to learn a model capable of denoising, why are we adding noise back again in order to denoise an image??
Indeed, we will add noise right after each prediction + jump, which will effectively make us perform a random extra jump in each iteration. Here it is:
So each sampling iteration involves using the model to predict a direction, performing a jump in that direction, and then randomly performing an extra-jump based on some randomly generated noise
That extra-noise that is added has the effect of making the sampling process “deviate” even more from
We can now note the following crucial statement: Our goal is not to accurately reconstruct an original image; all we want is to generate good-looking images out of random noise. We train our model to learn to reconstruct images just as a convenient, tractable proxy for the real task we want to perform.
To summarize: it is impossible to generate the perfect, original image out of an
All that is left is to rinse and repeat for
If we were now to take this predicted direction and jump straight to what it would be the updated
Nevertheless, we will again perform a small jump in that direction and add random extra-noise
Yielding a prediction for
From this point on, we can just repeat the same procedure until we have finished all
Furthermore, the iterative nature of the sampling algorithm not only helps with getting something else than a blurry, average dog face; it also allows the model to iteratively refine the generated image. During the first iterations the model mostly generates very high-level shapes and base colors (as you can see in the diagram above for
And that’s about it. Keep in mind that numerous publications that came after the DDPM paper changed many aspects of the algorithm. For instance:
- Some diffusion variants do not add this extra-noise
in the sampling algorithm (such as in DDIM by Song et al. [2020]) - Other implementations try to predict
with a model instead of manually fixing its values upfront (for instance in Nichol & Dhariwal [2021]) - Many variants use different values of
for training and sampling (as seen in the two previous mentioned papers as well. In DDIM they basically “skip” some s during sampling and jump right from to for instance)
Alongside many, many other algorithm modifications that have been presented since 2020.
Citation
If you would like to cite this post in an academic context, you can use the following BibTeX snippet: